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* Apache Ignite PMC 

* Apache Kafka contributor 

* https://t.me/db links 

° https://t.me/nizhikov Talks 
* https://github.com/nizhikov 


* nizhikov@apache.org 


Speech structure 


* BTree (B+tree, B*tree, Blink tree, etc.) design principles. 
* Basic tree operation implementation. 


* Concurrent tree design. 


B-trees design principles 


There are two core design limitations: 
* Hardware. 


* Security Officer. 
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Database features: 


* Store on the disk unless you in-memory DB. 
* Search: 

* Lookup. 

* Bounded iteration BETWEEN x AND Y clause. 


Database storage basics 


RAM 


* Storage unit is a page. Buffer #1 


* Page size aligned with the disk storage unit. I 
Table page Y € 


° Pages combined in reusable buffers. 


Buffer #2 


° Different page types: 


° WAL (write ahead log) stores records delta. 


* Pages written as a whole periodically during checkpoint or similar process. 
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Table Page Y 


Binary tree (AVL or Red- 
Black 


* Good concurrency. 

* Easy to implement. 

* Good performance: 
* Search - O (log n). 
* |nsert - O (log n). 


* Remove - O (log n). 


Binary tree (AVL or Red- 
Black 


* Bad hardware mapping. 


Tree properties 


* All leaves at the same level. 


* The root has at least two children. 


Each node except root can have a maximum of m children and at 
least n children. 


Each node can contain a maximum of m-1 keys and a minimum of n 
- 1 keys. 


Advantages 
* Small tree height on practice. 
* Good performance: 


* All leafs on the same level - same search performance for each 
key, 


* Insertion, remove rarely modifies more then one page. 


* Hardware friendly structure. 


Example 


Each node size aligned with 
the disk write unit - 2kb, 


Akb, etc. MD 
Several keys stored in each Mim m 
node. PL ele 
Inner node store tuple (key, LL] CL — zeje el 
page-link). 


Only leaf node stores actual 
data - (key, row-link). 


Keys sorted inside node. 


Tree operations 


Search (contains) 


key - 50 
42 > 50 7 


key - 50 


key - 50 
contains z false 


Insertion - happy path 


key - 50 
42 > 50 7 


key - 50 


key - 50 


Insertion with Split 


key - 55 


"9 H all 


Removal - happy path 


Removal with merge 


Removal - leftmost item 


Searching key inside tree node 


* If the keys has constant sizes use a binary search algorithm. 


* Otherwise, just iterate the keys one by one. 


Concurrent tree design 


Concurrency issues 
* Several threads modify the same page. 
* Concurrent operations modify same nodes. 


* Basic operations implementations propagate bottom-up. 


T1 - search 75 


- inserts 88 


T1 - reads page. 


Naive approach - lock tree path 


el à 


Improved naive approach - optimistic 
descent 


* Read locks during traverse. 
* Write lock on the leaf. 
* Check for a split/merge. 


* |f yes, restart with write locking entire subtree. 


Lock model 


* Each page has regular ReadWrite lock. 


* Thread gets read lock - guarantee that page will not be modified 
concurrently. 


* Any number of threads can get read lock concurrently. 
* Write lock is exclusive. 


* When write lock is acquired no other process reads or writes the 
page. 


B-link - pointer to the right sibling 
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T1 - search 75 


T2 - inserts 88 


Split algorithm - insert 12 


1. Take write lock. 2. Split page. 


Split algorithm - insert 12 


3. Take write lock on parent. 2. Adjust pointers. 


Merge algorithm - remove 8 


1. Find an item and take locks. 2. Merge pages from right to left. 


Merge algorithm - remove 8 


3. Link the empty page to the left precesedor. 4. Update the parent 
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Range scan: x > 20 AND x < 40 
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Range scan: x « 35 
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Inverted order scan - How to implement 
ORDER BY x DESC? 


* Just make two indexes with inverted order: 


* More space. 

* Easy to implement and maintain. 
* Use left sibling link: 

* Less space. 


* Difficulties while scanning and maintaining. 


Links 


* Efficient locking for Concurrent Operations on B-trees. Lehman, Yao - 


https://dl.acm.org/doi/pdf/10.1145/319628.319663 


* Symmetric concurrent B-tree algorithm. Lanin, Shasha - 


https://dl.acm.org/doi/pdf/10.5555/324493.324589 


* Postgres nbtree docs - 
https://aithub.com/postares/postares/tree/master/src/backend/access 
/nbtree 


° Ignite BPlusTree implementation - 
https://aithub.com/apache/ianite/blob/master/modules/core/src/main 
ava/ora/apacnhe/lgnite/internal/processors/cache/persistence/tree/BP 


us [ree.java 


Thank you! 


